Team Spark

Author : Anushka Vuppala, Pranav Chandaliya

Problem Statement

Having access to clean and safe drinking water is vital for good health and is considered a fundamental human right. It is also an integral part of any successful health protection policy, and is crucial for promoting both national and local development. In certain areas, research has demonstrated that investing in water supply and sanitation can have a positive economic impact, as the benefits of reduced health problems and associated healthcare expenses can outweigh the costs of implementing these interventions.

With the help of this dataset, we would identify whether water is safe for drinking or not based on its chemical features, such as pH level, hardness, chlorine quantities and other relevant parameters.

Problem%20Statement.jpg

Data Source : Kaggle

https://www.kaggle.com/datasets/adityakadiwal/water-potability?datasetId=1292407

Data Description

  1. pH value: Indicator of acidic or alkaline condition of water status.

  2. Hardness: Amount of calcium and magnesium salts.

  3. Solids (Total dissolved solids - TDS): Amount of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc.

  4. Chloramines: Chlorine and chloramine content amounts

  5. Sulfate: Sulfates are naturally occurring substances that are found in minerals, soil, and rocks.

  6. Conductivity: Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current.

  7. Organic_carbon: Total Organic Carbon (TOC) is a measure of the total amount of carbon in organic compounds in pure water.

  8. Trihalomethanes: THMs are chemicals which may be found in water treated with chlorine.

  9. Turbidity: The turbidity of water is used to indicate the quality of waste discharge with respect to colloidal matter.

  10. Potability: Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

Contents

- Loading of dataset, libraries and basic statistics of the data and checking null values

- EDA

- Feature Selection

- Data Preprocessing

- Testing different imputation techniques (on Logistic Regression)

- Predictive Modeling - SVM & Random Forest

- Boosting model comparision : XGBoost,CatBoost, LightGBM

- Model interpretability using SHAP (SHapley Additive exPlanations)

- Understanding model prediction

- Sample prediction from model

- Efficient probablity Threshold Cut-off for better recall

Importing all the necessary libraries

PH has 14% data missing and Sulfate has around 24% data missing

Dropping the rows won't be a good option as we will loose lots of data.

Lets decide on the basis of EDA

EDA

Looking at the above pie chart, we can conclude that the dataset contains about 61% of drinking water and the rest 39% belongs to non potable water.

Looking at the above plot, we can say that the ph of drinking and non-potable water both lie between 4 and 10 with the ideal value lying between 5.5 and 8.5.

Looking at the above plot, we can say that the hardness of both potable and non-potable water lies between 100 to 300ppm. And we know the hard water is generally >120ppm. This is suffice to say that we drink hard water which contains minerals.

From the above plot, we observe that solids usually lie between 0 and 50k ppm. But the prescribable limit it between 500 to 1000ppm. With this, we say that the drinkign water available to us contains a lot of solids which can be deemed as harmful.

The level of chlorine in water as shown above lies in the range 4 to 10 mostly. And the prescribable limit of chlorines in drinking water is 4ppm. With this again we can say that we drink water that have a slightly higher levels of chlorine in it.

Looking at the sulfate concentration too, we can say that it is leaning towards the higher side. Some geogrphical locations have sulfate between 500 ppm to 1000ppm. However, when we compare that with freshwater (with 3 to 30ppm concentration of Sulfate) there is a huge difference.

From the above plot of Electric Conductivity in water, we can say that these water bodies taken in the dataset have a slightly higher EC > 400uS/cm.

The amount of Organic Carbon seen in the water bodies in questions, we can conclude that very few of them have higher organic carbon (> 25ppm) which is considered unfit for drinking.

The normal range of Trihalmethanes in the waterbodies generally lie between 20 to 100ppm. Generally, water unsafe for drinking have Trihalmethanes above 80ppm.

The turbidity in water safe for drinking should not be more than 5 NTU. Based on the above plot, we can say that about 10% of the water bodies taken in this dataset have slightly higher amounts of turbidity.

With the above heatmap, we say that very few chemical features have a high correlation with Potability. The highest in comoarison to others is Sulfate which is negatively correlated with Potability.

Feature selection techniques

Chi Square

From the above Chi Squared test, we compare the p-values of each checmical property and conclude that Solids and EC (Conductivity) have lowest p-values and hence is considered to be the most significant.

Random Forest Feature Importance

From the above feature selections, we can conclude that the chemical features with the highest importance is

After performing feature selection its evident that each and every feature is important for predicting

We decided to use all features for modeling

Splitting features and target

Spliting the data before doing imputation to avoid Data Leakage

Data Preprocessing for Modeling

Lets Try out different data imputation techniques and Decide which one to go for

  1. Mean Imputation (As we observed Ph and Sulfate have normal distribution
  2. KNN Imputation
  3. No Imputation (Tree based model can handel missing data by itself)

Mean Imputation

Defining model evaluation function code to add readability

KNN Imputation

From the above 2 imputation techniques, we cobserve that the ROC for Mean Imputation (0.45) is equal to that of KNN imputation as well (0.45). Hence, we could take either. Moving forward, for simplicity, we would be considering the mean imputation technique.

We first used base model as Logistic Regression, We observed that its not able to seperate the classes. It indicates that we need to try advance modeling techniques as we know that logistic regression works well with linearly sperable data but not with Non-Linear data

Lets try out SVM, SVM can seprate data in better way

Modeling

SVM Classifier

SVM performed well as compared to Logistic Regression but Recall is low

Now we will try out Tree based models like Random forest and boosting tree models, As we know that Tree based models can seperate non linear data in better way.

Random Forest Classifer

Random Forest did'nt perform well it has low AUC

Boosting Tree Models

XGBoost performed better

CatBoost performed best as AUC of 0.7

Model Selection

After observing each and every model we observed that Catboost performed well as it gave us AUC 0f 0.70

Model Interpretation

Above graph shows us how Features impact model

The graph illustrates how the features affect the model's output. It indicates that changing the pH scale alone does not make the water potable, suggesting that there is an optimum pH range for safe drinking water. Additionally, the graph suggests that the sulfate level should be kept low for drinkable water.

As a part of model interpretation

We Will try to see how model is predicting for above features, it has predicted as water is not potable we will try to interpreat this using Shap

By leveraging SHAP, we can infer that the presence of certain features such as ph, hardness, organic carbon, and sulfate suggests that the water is suitable for consumption. However, the features solid and chloramines indicate that the water does not fall within the acceptable range for safe drinking water. As a result, the overall output of the model indicates that the water is not potable.

Efficient probablity Threshold Cut-off for better recall

To enhance the model's performance, we can implement an efficient threshold cutoff. Our goal in this project is to ensure that the water is potable with no false positives. We aim to achieve a higher recall to confirm that the water is safe for drinking while maintaining the model's accuracy.

We can deduce that at a threshold of 0.34, we can achieve a good recall of 75% while maintaining high accuracy. However, if we desire to be more confident in the water's potability, we can opt for a threshold of 0.3, which would yield an 80% recall.